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Summary 

Nested sampling is a simulation method for approximating marginal likelihoods proposed by |Skilling| 
([2006 ). We establish that nested sampling has an approximation error that vanishes at the standard Monte 
Carlo rate and that this error is asymptotically Gaussian. We show that the asymptotic variance of the 
nested sampling approximation typically grows linearly with the dimension of the parameter. We discuss 
the applicability and efficiency of nested sampling in realistic problems, and we compare it with two 
current methods for computing marginal likelihood. We propose an extension that avoids resorting to 
Markov chain Monte Carlo to obtain the simulated points. 

Some key words: Central limit theorem; Evidence; Importance sampling; Marginal likelihood; Markov chain Monte 
Carlo; Nested sampling. 



1 . Introduction 

Nested sampling was introduced by |Skillirig] ( |2006| l as a numerical approximation method for integrals 
of the kind 



z = j L{y\e)Ti{e)de, 

when TT is the prior distribution and L{y\ 9) is the likelihood. Those integrals are called evidence in the 
above papers. They naturally occur as marginals in Bayesian testing and model choice Jeffreys) |1939 



[Robertj |2001[ Chapters 5 and 7). Nested sampling has been well received in astronomy and has been 



applied successfully to several cosmological problems, see, for instance, Mukherjee et al. (2006JI, Shaw 



et al. ( 2007| l, and Vegetti & Koopmans, ( ,2009^ , among others. In addition, Murray et al., ( ,2006, ) develop a 



nested sampling algorithm for computing the normalising constant of Potts models. 

The purpose of this paper is to investigate the formal properties of nested sampling. A first effort in that 
direction is ,Evans| ( |2007| ), which shows that nested sampling estimates converge in probability, but calls 
for further work on the rate of convergence and the Umiting distribution. 

Our main result is a central limit theorem for nested sampling estimates, which says that the approx- 
imation error is dominated by a 0(A^^^/^) stochastic term, which has a limiting Gaussian distribution, 
and where is a tuning parameter proportional to the computational effort. We also investigate the im- 
pact of the dimension d of the problem on the performances of the algorithm. In a simple example, we 
show that the asymptotic variance of nested sampling estimates grows linearly with d; this means that the 
computational cost is 0{d^ /rf'), where r/ is the selected error bound. 

One important aspect of nested sampling is that it resorts to simulating points 6i from the prior tt, con- 
strained to 6i having a larger likelihood value than some threshold /. In many cases, the simulated points 
must be generated by Markov chain Monte Carlo sampling. We propose an extension of nested sam- 
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49 pling, based on importance sampling, that introduces enough flexibihty so as to perform the constrained 

50 simulation without resorting to Markov chain Monte Carlo. 

51 Finally, we examine two alternatives to nested sampling for computing evidence, both based on the 

52 output of Markov chain Monte Carlo algorithms. We do not aim at an exhaustive comparison with all 

53 existing methods, see, for instance, |Chen et aL| ( |2000[ ), for a broader review, and restrict our attention to 

54 methods that share the property with nested sampling that the same algorithm provides approximations 

55 of both the posterior distribution and the marginal likelihood, at no extra cost. We provide numerical 

56 comparisons between those methods, since some of the aforementioned papers and Murray's PhD thesis 

57 (2007, University College London), also include numerical comparisons of nested sampling with other 

58 methods for several models. 
59 

60 

61 2. Nested sampling: A description 

2-1. Principle 

We briefly describe the nested sampling algorithm, as introduced by SkilUng ( 2006| l. We use L{6) as a 
short-hand for the likelihood L{y \ 9), omitting the dependence on y. 
Nested sampling is based on the following identity: 

67 /•! 

68 ip{x)dx, (1) 

69 ^° 

70 where is the inverse of the survival function of the random variable L{9), 

72 ^-^ ■.i-^w{L{e)>i}, 

'^^ assuming 9 ^ tt and ip^^ is a decreasing function, which is the case when L is a continuous function and 
TT has a connected support. The representation Z = E'^{L{9)} holds with no restriction on either L or tt. 
Formally, this one-dimensional integral could be approximated by standard quadrature methods, 

76 

77 - 

78 ^ = 2^i.^i-i ~ : (2) 

79 '=1 

80 where ipi = ip{xi), and < < • • • < < xo = 1 is an arbitrary grid over [0, 1]. Function tp is in- 

81 tractable in most cases however, so the ipi's are approximated by an iterative random mechanism: 
82 

83 - Iteration 1: draw independently N points 9i i from the prior tt, determine 9i = arg mini<j<jv L{9i i), 

84 and set = ^(^i). 

85 - Iteration 2: obtain the N current values 02, i, by reproducing the 9i/s, except for 9i that is re- 

86 placed by a draw from the prior distribution tt conditional upon L{9) > ipi; then select 6*2 as 9-2 = 

87 argmini<i<jv ^(6*2,1), and set ip2 = L{92). 

88 - Iterate the above step until a given stopping rule is satisfied, for instance when observing very small 

89 changes in the approximation Z or when reaching the maximal value of L{9) when it is known. 
90 

91 In the above, the values x* = ip^^{ipi) that should be used in the quadrature approximation ^ are 

92 unknown, but they have the following property: ti — ip'^{ipi+i)/(p^^{(pi) — x*^i/x* are independent 

93 beta(A^, 1) variates. Skilling| ( ,2006) proposes two approaches: first, a deterministic scheme, where Xi is 

94 substituted with exp{—i/N) in (|2j, so that log.Ti is the expectation of log ip^^ {ipi); second, a random 

95 scheme, where K parallel streams of random numbers Xi_k, k = 1, . . . , K, are generated from the same 

96 generating process as the x*, Xi+i,k = Xi^kti,k, where ti^k ~ beta(A^, 1). In the latter case, a natural esti- 
mator is: 



logZ = — ^logZfe, Zk = - Xi^k)(Pi ■ 

k=l i=l 
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97 For the sake of brevity, we focus on the deterministic scheme in this paper, and study the estimator (|2]i 

98 and Xi = exp(— i/iV). Furthermore, for K — 1, the random scheme produces more noisy estimates than 

99 the deterministic scheme, but, for large values of K, it may be the opposite, see for instance Fig. 3 in 
100 [Murray et aL] ( [2006t . 



101 

IQ2 2-2. Variations and posterior simulation 

103 Skilhng ( 2006| l indicates that nested sampling provides simulations from the posterior distribution at no 



^{xi-^i - Xi)(pif{9i) (3) 



104 extra cost: "the existing sequence of points 61,62,03, .. . already gives a set of posterior representatives, 

105 provided the i'th is assigned the appropriate importance weight LOiL", where the weight uji is equal to the 

106 difference {xi^i — Xi) and Li is equal to ipi. This can be justified as follows. Consider the computation 

107 of the posterior expectation of a given function / 
108 
109 
110 

111 One can then use a single run of nested sampling to obtain estimates of both the numerator and the 

1 12 denominator, the latter being the evidence Z, estimated by (j2|. The estimator 
113 

114 
115 

L J. 

116 

-y-yrj of the numerator is a noisy version of 

118 i _ 

119 y^(a:,„i - Xi)ipJ{ipi) , 

120 »=i 
121 
122 
123 

124 Lemma 1 . Let f{l) = E^{f{6) \ L{6) = l}for I > 0, then, iff is absolutely continuous, 

126 / ip{x)fMx)}dx^ TT{6)L{6)mde. (4) 

127 Jo J 

128 A proof is provided in Appendix 1. Clearly, the estimate of ^(/) obtained by dividing (j3|l by ([2]) is the 

129 estimate obtained by computing the weighted average mentioned above. We do not discuss further this 

130 aspect of nested sampling, but our convergence results can be extended to such estimates. 
131 

132 

133 3. A CENTRAL LIMIT THEOREM FOR NESTED SAMPLING 



where f{l) — E^{f{6) \ L{6) = I}, the prior expectation of f{6) conditional on L{6) = I. This Riemann 
sum is, following the principle of nested sampling, an estimator of the evidence. 



We decompose the approximation error of nested sampling as follows: 



134 
135 

136 ,A 

137 ^{xi^i ~ Xi)ipi - I (p{x)dx^- I ip{x)dx 



=1 







- Xi)(p{xi) - / ip{x) dx y + ^{xi-i - x^) {ipi - (fiixi)} 
i=i J^ J i=i 



138 
139 
140 
141 

142 The first term is a truncation error, resulting from the feature that the algorithm is run for a finite time. 

143 For simplicity's sake, we assume that the algorithm is stopped at iteration j — [(— log e)iV] , where \x'\ 

144 stands for the smallest integer k such that x < k, so that xj — ex-p{—j/N) < e < Xj^i. More practi- 
cal stopping rules are discussed in S|7] Assuming ip, or equivalently L, bounded from above, the error 

(p{x) dx is exponentially small with respect to the computational effort. 

The second term is a numerical integration error, which, provided ip' is bounded over [e, 1], is of order 
O(iV-i), since x^-i -x^ = 0{N-^). 
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145 The third term is stochastic and is denoted 

146 

147 = ^ 

^-1 



3 

VN = y^^{xi-i - x^) {ip{x*) - f{xi)} , 

148 
149 
150 



where the x*'s are such that ipi — L{9i) = ip{x*), therefore x* = ip (vi)- 
The following theorem characterises the asymptotic behaviour of rjN. 

152 Theorem 1. Provided that Lp is twice continuously-dijferentiable over [e, 1], and that its two first 

153 derivatives are bounded over [e, 1], then N^^'^rjN converges in distribution to a Gaussian distribution 

154 with mean zero and variance 

155 „ 

156 sip'{s)tip'(t)log{s\/ t)dsdt. 

157 Js.te[6.i] 

The stochastic error is of order Op{N^^/'^) and it dominates both other error terms. The proof of this 
theorem relies on the functional central limit theorem and is detailed in Appendix 2. A straightforward 
application of the delta-method shows that the log-scale error, logZ — logZ, has the same asymptotic 
behaviour, but with asymptotic variance V/Z'^. 

162 
163 

4. Properties of the nested sampling algorithm 

165 

4-1. Simulating from a constrained prior 

167 The main difficulty of nested sampling is to simulate 6 from the prior distribution tt subject to the 

168 constraint L{9) > L{6i); exact simulation from this distribution is an intractable problem in many realistic 

169 set-ups. It is at least of the same complexity as a one-dimensional slice sampler, which produces an 

170 uniformly ergodic Markov chain when the likelihood L is bounded but may be slow to converge in other 



171 se ttings (|Roberts & Rosenthal! fl999[ ) 



172 Skilling (2006 1 proposes to sample values of 6 by iterating M Markov chain Monte Carlo steps, using 

173 the truncated prior as the invariant distribution, and a point chosen at random among the — 1 survivors 

174 as the starting point. Since the starting value is already distributed from the invariant distribution, a finite 

175 number M of iterations produces an outcome that is marginally distributed from the correct distribution. 

176 This however introduces correlations between simulated points. We stress that our central limit theorem 

177 applies no longer when simulated points are not independent, and that the consistency of nested sampling 

178 estimates based on Markov chain Monte Carlo is an open problem. A reason why such a theoretical result 

179 seems difficult to establish is that each iteration involves both a different Markov chain Monte Carlo kernel 

180 and a different invariant distribution. 

181 There are settings when implementing a Markov chain Monte Carlo move that leaves the truncated prior 

182 invariant is not straightforward. In those cases, one may instead implement an Markov chain Monte Carlo 

183 move, for instance a random walk Metropolis-Hastings move, with respect to the unconstrained prior, 

184 and subsample only values that satisfy the constraint L{d) > L{9i), but this scheme gets increasingly 

185 inefficient as the constraint moves closer to the highest values of L. More advanced sampling schemes 

186 can be devised that overcome this difficulty, such as the use of a diminishing variance factor in the random 

187 walk. 

188 f|5] we propose an extension of nested sampling based on importance sampling. In some settings, 

189 this may facilitate the design of efficient Markov chain Monte Carlo steps, or even allow for sampling 

190 independently the 6'i's. 
191 

4-2. Impact of dimensionality 

We show in this section that the theoretical performance of nested sampling typically depends on the 
dimension d of the problem as follows: the required number of iterations and the asymptotic variance both 
grow linearly with d. Thus, if a single iteration costs 0{d), the computational cost of nested sampling is 
0{d^ /rf), where 77 denotes a given error level; Murray's PhD thesis also states this result, using a more 
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Lemma 2. In the current setting, if Vd is the asymptotic variance of the nested sampling estima- 
tor with truncation point Sd, there exist constants ci, Ci such that Vd/d < ci for all d > I, and 
liminf<j^+oo Vd/d > ci. 



193 heuristic argument. This result applies to the exact nested algorithm only. In principle, resorting to Markov 

194 chain Monte Carlo might entail some additional curse of dimensionality, but this point seems difficult to 

195 study formally, and will only be briefly investigated in our simulation studies. 

196 Consider the case where, for /c = 1, . . . , d, Q^^'^ - 7V(0, erg), and y^^) | 6'('=) - 7V(6'('=) , cr?) , indepen- 

197 dently in both cases. Set — and ctq = = l/47r, so that Z = \ for all d's. A draw from the con- 

198 strained prior is obtained as follows: simulate < —2^/^ log^ from a truncated x^id) distribution and 

199 ui, . . . ,Ud ^ A/'(0, 1), then set ^^^^ — ruk/{u{ + . . . + u^^/"^. Since Z = 1, we assume that the trunca- 

200 tion point is such that Lp{Q)ed = t ^ 1, r = 10^® say, where (^(0) — 2''/^ is the maximum likelihood 

201 value. Therefore, Sd — r2~'^/^ and the number of iterations required to produce a given truncation error, 

202 that is, j = [(— log e)N~\ , grows linearly in d. To assess the dependence of the asymptotic variance with 

203 respect to d, we state the following lemma, established in Appendix 3. 
204 
205 
206 
207 

208 This lemma is easily generalised to cases where the prior is such that the components are independent 

209 and identically distributed, and the likelihood factorises as L{9) = 0^=1 ^(^^'^^)- conjecture that 

210 Vd/d converges to a finite value in all these situations and that, for more general models, the variance 

211 grows linearly with the actual dimensionality of the problem, as measured for instance in Spiegelhalter 

212 |iFiLl ( |2002) . 
213 
214 

215 5. Nested importance sampling 
216 
217 

2j^g prior with the support of tt included in the support of vr, and let L{9) an instrumental likelihood, namely a 
positive measurable function. We define an importance weight function w{9) such that ti{9)L{9)w{9) = 

220 n{9)L{9). We can approximate Z by nested sampling for the pair (tt , L), that is, by simulating iteratively 

221 from tt constrained to L{9) > I, and by computing the generalised nested sampling estimator 

222 J 

223 V(x,_i-x,)v',u;(0,)- (5) 
224 

225 

226 Ths advantage of this extension is that one can choose (tt, L) so that simulating from tt under the constraint 

227 L{9) > I is easier than simulating from tt under the constraint L{9) > I. For instance, one may choose an 

228 instrumental prior tt such that Markov chain Monte Carlo steps adapted to the instrumental constrained 

229 prior are easier to implement than with respect to the actual constrained prior.In a similar vein, nested 

230 importance sampling facilitates contemplating several priors at once, as one may cornpute the evidence 

231 for each prior by producing the same nested sequence, based on the same pair (tt, L), and by simply 

232 modifying the weight function. 

233 Ultimately, one may choose {tt, L) so that the constrained simulation is performed exactly. For instance, 

234 if TT is a Gaussian J\fd{9, S) distribution with arbitrary hyper-parameters, take 
235 
236 
237 

23g where A is an arbitrary decreasing function. Then 

239 
240 



We introduce an extension of nested sampling based on importance sampling. Let tt{9) an instrumental 



L{9) = X \^{9 - 9f±-\9 - 9)^ , 



ip^w{9,) = L{9,)w{9,) = 7T{9,)L{9,)/n{9,) . 

In this case, the Xi's in (|2]i are error-free: at iteration i, 9i is sampled uniformly over the ellipsoid that con- 
tains exactly exp(— i/iV) prior mass as 9i = q.iCv /\\v\\y^ , where C is the Cholesky lower triangle of E, 
V ~ Nd{Q, Id), and qi is the exp(— i/A^) quantile of a x^{d) distribution. Mukherjee et al. ( ,2006 ) consider 
a nested sampling algorithm where simulated points are generated within an ellipsoid, and accepted if they 
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respect the likelihood constraint, but their algorithm is not based on the importance sampling extension 
described here. 

The nested ellipsoid strategy seems useful in two scenarios. First, assume both the posterior mode and 
the Hessian at the mode are available numerically and tune and S accordingly. In this case, this strat- 
egy should outperform standard importance sampling based on the optimal Gaussian proposal, because 
the nested ellipsoid strategy uses a 0{N^^) quadrature rule on the radial axis, along which the weight 
function varies the most; see j j7-3| for an illustration. Second, assume only the posterior mode is available, 
so one may set 6 to the posterior mode, and set S — rid, where r is an arbitrary, large value. Section 
7-3| indicates that the nested ellipsoid strategy may still perform reasonably in such a scenario. Models 
such that the Hessian at the mode is tedious to compute include in particular Gaussian state space models 
with missing observations ( |Brockweir& Davis| 1996] Chap. 12), Markov modulated Poisson processes 
(!Ryden||1994i, or, more generally, models where the expectation-maximisation algorithm (seCj MacLach- 



lan & Krishnan 1997 1 is the easiest way to compute the posterior mode, although one may use Louis' 
( 1982 1 method for computing the information matrix from the expectation-maximisation output. 



6. Alternative algorithms 
Approximating Z from a posterior sample 
As recalled in \ 2-2 the output of nested sampling can be "recycled" so as to approximate posterior 



6-1. 



quantities. Conversely, one can recycle the output of an Markov chain Monte Carlo algorithm towards 



estimating the evidence, with no or little additional programming effort; see for instance [Gelfand & Dey 
( |1994| l, |Meng & Wong] ( |1996[ ), and |Chen & Shao| ( [T997| . We describe below the solutions used in the 
subsequent comparison with nested sampling, but we do not pretend at an exhaustive coverage of those 
techniques, see Chen et al. (2000) or Han & Carlin (2001 1 for a deeper coverage, nor at using the most 
efficient approach, see Meng & Schilling| ( |2002) ). 



6-2. Approximating Z by a formal reversible jump 

We first recover Gelfand and Dey's (1994) solution of reverse importance sampling by an integrated 
reversible jump, because a natural approach to compute a marginal likelihood is to use a reversible jump 
Markov chain Monte Carlo algorithm (Green 1995). However, this may seem wasteful as it involves sim- 
ulating from several models, while only one is of interest. But we can in theory contemplate a single model 
Ai and still implement reversible jump in the following way. Consider a formal alternative model M.' , for 
instance a fixed distribution like the A/^(0, 1) distribution, w ith prior weight 1 /2 and build a proposal from 
M toM' that moves to M' with probability ( |Green[|l995| pm^M' = {(l/2)g(e')}/{(l/2)7r(6')i(6')} A 
1 and from A^' to with probability pm'^m = {(l/2)7r(6')i(0)}/{(l/2).g(6l)} A 1 , g{B) being an ar- 
bitrary proposal on 9. Were we to actually run this reversible jump Markov chain Monte Carlo algorithm, 
the frequency of visits to Ai would then converge to Z. 

However, the reversible sampler is not needed since, if we run a standard Markov chain Monte Carlo 
algorithm on 9 and compute the probability of moving to M' , the expectation of the ratio g{9) /tt{9)L{9) 
is equal to the inverse of Z: 



E{g{e)/n{e)m] 



g{9) n{e)L{e) 



d9 = \/Z , 



no matter what g{9) is, in the spirit of both [Gelfand & Dey| ( |1994| l and |Bartolucci et al.| ( |2006[ ). 

Obviously, the choice of g{9) impacts on the precision of the approximated Z. Wlien using a kernel 
approximation to -^{9 \ y) based on earlier Markov chain Monte Carlo simulations and considering the 
variance of the resulting estimator, the constraint is opposite to the one found in importance sampling, 
namely that g{9) must have lighter (not fatter) tails than tt{9)L{9) for the approximation 
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to have a finite variance. This means that light tails or finite support kernels, like an Epanechnikov kernel, 
are to be preferred to fatter tails kernels, like th e t k ernel. 

In the experimental comparison reported in { 7-2 we compare Zi with a standard importance sampling 
approximation 



t=i 



where g can also be a non-paramet ric approximation of 7t{9 \ y), this time with heavier tails than 

n{e)m. 



Friihwirth- S chnatter 



" ( 2004 1 uses the same importance function g in both Zi and Z2, and obtain 
results similar to ours, namely that Z2 outperforms Zi . 

6-3. Approximating Z using a mixture representation 

Another approach in the approximation of Z is to design a specific mixture for simulation purposes, 
with density proportional to 

m{e) cx wxTT{e)L{e) + g{e) 

where wi > and g{6) is an arbitrary, fully specified density. Simulating from this mixture has the same 
complexity as simulating from the posterior, the Markov chain Monte Carlo code used to simulate from 
7r(0 I y) can be easily extended by introducing an auxiliary variable 5 that indicates whether or not the 
current simulation is from tt{0 \ y) or from g{6). The t-th iteration of this extension is as follows, where 
/C(6', 9') denotes an arbitrary Markov chain Monte Carlo kernel associated with the posterior 7r(6' | y) oc 

ir{e)L{e): 



1. Take (5^*-' = 1, and (J*^*-* = 2 otherwise, with probability 



2. If (5(*) = 1, generate 6*'*' Markov chain Monte Carlo(6'(*~i\ 6'(*)), else generate 6'(*) g{9) inde- 
pendently from the previous value 6*'*^^'. 

This algorithm is a Gibbs sampler: Step 1 simulates (5^^*^ conditional on 6'^*"^-', while Step 2 simulates 
0*^*^ conditional on 6^*-\ While the average of the J^'^'s converges to uJiZ / {loiZ + 1}, a natural Rao- 
Blackwellisation is to take the average of the expectations of the (5(*^'s, 



T 

t=i I 



since its variance should be smaller. A third estimate is then deduced from this approximation by solving 

wiZa/l^iZa + 1} = ^. 

The use of mixtures in importance sampling in order to improve the stability of the estimators dates 



back at least to Hesterberg ( 1998 ) but, as it occurs, this particular mixture estimator happens to be almost 



identical to the bridge sampling estimator of |Meng & Wong ( 1996[ l. In fact, 

cji7r(6i(*))L(6'(*)) 5(6l(*)) 



1 



^ t^i7r(6'(*))i(6i(*)) + g(6l(*)) / ^ tJi7r(6'(*))L(6l(*)) + 5(6'(*)) 
is the Monte Carlo approximation to the ratio 

E^{oc{e)i:{d)L{y \ d)} I EMd)g{Q)\ 



when using the optimal function aiO) — l/{a;i7r(0)L(0) + (7(0)}. The only difference with Meng & 



Wong ( 1996 1 is that, since 6'^*'''s are simulated from the mixture, they can be recycled for both sums. 
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N = 100, M = l 



N = 100, M=3 



N = 100, M=5 



O O T 



0' 
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N=500, M = l 
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N = 100, M=5 
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Fig. 1. Decentred Gaussian example: Box-plots of the log- 
relative error log Z — log Z versus dimension d for four 
values of (A'^, M), and (lower right) total number of itera- 
tions (x 10'') versus dimension for (A*', M) — (100, 5) 



We modify the Gaussian toy example presented in {4-2 



1. Numerical experiments 
7-1. A decentred Gaussian example 



0{k) 



.,6'('')), where the eiC^^'s are m- 
Af{9^''\ 1) independently, but setting 



dependent and identically distributed from Af{0, 1), and 
all the y^s to 3. To simulate from the prior truncated to L{d) > L{9q), we perform M Gibbs iterations 
with respect to this truncated distribution, with A/ = 1, 3 or 5: the full conditional distribution of 6'^'^^ 
conditional on 9^^\ j ^ k, is a JV{0, 1) distribution that is truncated to the interval [y^'^^ — S^y'^^^ + 8\ 
with 

The nested sampling algorithm is run 20 times for d ~ 10, 20, . . ., 100, and several combinations of 
{N,M): (100,1), (100,3), (100,5), and (500,1). The algorithm is stopped when a new contribution 
{xi-i — Xi)ipi to (|2| becomes smaller than 10~® times the current estimate. Focussing first on iV = 100, 
Fig. [T] exposes the impact of the mixing properties of the Markov chain Monte Carlo step: for M — 1, 
the bias sharply increases with respect to the dimension, while, for M — 3, it remains small for most 
dimensions. Results for M — 3 and AI — 5 are quite similar, except perhaps for d — 100. Using M = 3 
Gibbs steps seems to be sufficient to produce a good approximation of an ideal nested sampling algorithm, 
where points would be independently simulated. Interestingly, if N increases to 500, while keeping M = 
1, then larger eiTors occur for the same computational effort. Thus, a good strategy in this case is to 
increase first M until the distribution of the error stabilises, then to increase N to reduce the Monte Carlo 
error. As expected, the number of iterations linearly increases with the dimension. 

While artificial, this example shows that nested sampling may perform quite well even in large dimen- 
sion problems, provided M is large enough. 
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7-2. A mixture example 



As in 



Friihwirth-Schnatter (2004 1, we consider the example of the posterior distribution on {fi, a) asso- 



ciated with the normal mixture 



yi, 



(6) 



when p is known, for two compelling reasons. First, when a converges to and /i is equal to any of the 
Xi's {1 < i < n), the likelihood diverges, see Fig.|2] This is a priori challenging for exploratory schemes 
such as nested sampling. Second, efficient Markov chain Monte Carlo strategies have been developed for 



mixture models (Diebolt & Robert 1994 Richardson & Green 1997 Celeux et al. 2000 1, but Bayes 



factors are difficult to approximate in this setting. 

We simulate n observations from a A/'(2, (3/2)^) distribution, and then compute the estimates of 
Z introduced above for the model (|6]l. The prior distribution is uniform on (—2,6) x (0 • 001, 16) for 
(/i, log (7^). The prior is arbitrary, but it allows for an easy implementation of nested sampling since the 
constrained simulation can be implemented via a random walk move. 

The two-dimensional nature of the parameter space allows for a numerical integration of L{9), based 
on a Riemann approximation and a grid of 800 x 500 points in the (—2, 6) x (0 • 001, 16) square. This 
approach leads to a stable evaluation of Z that can be taken as the reference against which we can test 
the various methods, since additional evaluations based on a crude Monte Carlo integration using 10^ 
terms and on 'Chib's (1995) produced essentially the same numerical values. The Markov chain Monte 
Carlo algorithm implemented here is the standard completion of Diebolt & Robert ( 1994| l, but it does not 
suffer from the usual label switching deficiency Pasra et aL||2005 1 because ([6|l is identifiable. As shown 
by the Markov chain Monte Carlo sample of size N = 10^ displayed on the left hand side of Fig.|2j the 
exploration of the modal region by the Markov chain Monte Carlo chain is satisfactory. This Markov 
chain Monte Carlo sample is used to compute the non-parametric approximations g that appear in the 
three alternatives of Sj6] For the reverse importance sampling estimate Zi, g is a product of two Gaussian 
kernels with a bandwidth equal to half the default bandwidth of the R function densityO, while, for both 
Z2 and Z3, 5 is a product of two t kernels with a bandwidth equal to twice the default Gaussian ba ndwidth. 

We ran the nested sampling algorithm, with N = 10"^, reproducing the implementation of Skilling 



(2006 1, namely using 10 steps of a random walk in (/i,log(T) constrained by the likelihood boundary, 
based on the contribution of the current value of (/i, cr) to the approximation of Z. The overall number of 
points produced by nested sampling at stopping time is on average close to 10^, which justifies using the 
same number of points for the Markov chain Monte Carlo algorithm. As shown on the right hand side of 
Fig.[2j the nested sampling sequence visits the minor modes of the likelihood surface but it ends up in the 
same central mode as the Markov chain Monte Carlo sequence. All points visited by nested sampling are 
represented without reweighting, which explains for a larger density of points outside the central modal 
region. 

The analysis of this Monte Carlo experiment in Fig.|3]first shows that nested sampling gives approx- 
imately the same numerical value when compared with the three other approaches, exhibiting a slight 
upward bias, but that its variability is higher. The most reliable approach, besides the numerical and raw 
Monte Carlo evaluations which cannot be used in general settings, is the importance sampling solution. 



followed very closely by the mixture approach of ^ 6-3 The reverse importance sampling naturally shows 
a slight upward bias for the smaller values of n and a variability that is very close to both other alternatives, 
especially for larger values of n. 

7-3. A probit example for nested importance sampling 

To implement the nested importance sampling algorithm based on nested ellipsoids, we consider the 
arsenic dataset and a probit model studied in Chapter 5 of |Gelman & Hill, (2006 ). The observations are 
independent Bernoulli variables j/j such that extpr{yi = 1 | Xi) = ^{x[9), where Xj is a vector of d 
covariates, is a vector parameter of size d, and (f> denotes the standard normal distribution function. In 
this particular example, d — 7; more details on the data and the covariates are available on the book's 
web-page (http : / /www . stat . Columbia . edu/ "gelman/ arm/ examples /arsenic). 
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Fig. 2. Mixture example: (left) Markov chain Monte Carlo 
sample plotted on the log-likelihood surface in the {/j,, a) 
space for n = 10 observations from ^ ( right) nested sam- 
pling sequence based on TV = 10'^ starting points for the 
same dataset 



Fig. 3. Mixture model: comparison of the variations of 
nested sampling, reverse importance sampling, importance 
sampling and mixture sampling, relative to a numerical ap- 
proximation of Z (dotted line), based on 150 samples of 
size n = 10, 50, 100 



The probit model we use is model 9a in the R program available at this address: the dependent variable 
indicates whether or not the surveyed individual changed the well she drinks from over the past three 
years, and the seven covariates are an intercept, distance to the nearest safe well (in 100 metres unit), 
education level, log of arsenic level, and cross-effects for these three variables. We assign Afd{0, 10^/d) 
as our prior on 0, and denote 9m the posterior mode, and S„j the inverse of minus twice the Hessian at 
the mode; both quantities are obtained numerically beforehand. 

We run the nested ellipsoid algorithm 50 times, for N ^ 2, 8, 32, 128, and for two sets of hyper- 
parameters corresponding to both scenarios described in fjs] In the first scenario, S) = (0m,2Em)- 
The bottom row of Fig. [4] compares log-errors produced by our method (left), with those of importance 
sampling based on the optimal Gaussian proposal, with mean d„i, variance S„i, and the same number 
of likelihood evaluations, as reported on the x-axis of the right plot. In the second scenario, {9, E) = 
{9„,, 100 /d). The top row of Fig. [5] compares log-errors produced by our method (left) with those of 
importance sampling, based again on the optimal proposal, and the same number of likelihood evaluations. 
The variance of importance sampling estimates based on a Gaussian proposal with hyper-parameters 9 and 
E = lOO/d is higher by several order of magnitudes, and is not reported in the plots. 
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As expected, the first strategy outperforms standard importance sampling, when both methods are sup- 
plied with the same information (mode, Hessian), and the second strategy still does reasonably well com- 
pared to importance sampling based on the optimal Gaussian proposal, although only provided with the 
mode. Results are sufficiently precise that one can afford to compute the evidence for the 2^ possible 
models: the most likely model, with posterior probability 0.81, includes the intercept, the three variables 
mentioned above, distance, arsenic, education, and one cross-effect between distance and education level, 
and the second most likely model, with posterior probability 0.18, is the same model but without the 
cross-effect. 




2 8 32 128 




62 244 971 3880 



Fig. 4. Probit example: Box-plots of (left column) log- 
errors of nested importance sampling estimates, for N = 
2, 8, 32, 128, compared with the log-error of importance 
sampling estimates (right column) based on the optimal 
Gaussian proposal, and the same number of likelihood 
evaluations. Those are reported on the x axis of the right 
column plots. The bottom row corresponds to the first strat- 
egy, based on both mode and Hessian, while the top row 
corresponds to the second strategy, based on mode only. 



8. Discussion 

Nested sampling is thus a valid addition to the Monte Carlo toolbox, with convergence rate 0(iV^^/^), 
and computational cost 0{cfi), where d is the dimension of the problem, which enjoys good performances 
in some applications, for example when the posterior is approximately Gaussian, but which may require 
more iterations to achieve the same precision in certain situations. Therefore, further work on the formal 
and practical assessments of nested sampling convergence would be welcome. For one thing, the con- 
vergence properties of Markov chain Monte Carlo-based nested sampling are unknown and technically 
challenging. Methodologically, efforts are required to design efficient Markov chain Monte Carlo moves 
with respect to the constrained prior. In that and other respects, nested importance sampling may con- 
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529 stitute a useful extension. Ultimately, our comparison between nested sampling and alternatives should 

530 be extended to more diverse examples, in order to get a clearer idea of when nested sampling should be 

531 the method of choice and when it should not. For instance, 'Murray et al. (2006i reports that nested sam- 



537 
538 
539 
540 
541 
542 
543 
544 



532 pling strongly outperforms annealed importance sampling (Neal, ,2001j for Potts models. All the programs 

533 implemented for this paper are available from the authors. 
534 
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594 Proof of Lemma^ 

595 It is sufficient to prove this result for functions / that are real-valued, positive and increasing. First, the 

596 extension to vector-valued functions is trivial, so / is assumed to be real-valued from now on. Second, 

597 the class of functions that satisfy property (|4|i is clearly stable through addition. Since / is absolutely 
continuous, there exist functions /+ and such that /+ is increasing, is decreasing, and / = + 

so we can restrict our attention to increasing functions. Third, absolute continuity implies bounded 
variation, so it always possible to add an arbitrary constant to / to transform it into a positive function. 
Lst -0 : Z — > lf{l), which is a positive, increasing function and denote its inverse by One has: 

603 ^ ^ 1 

604 E^'liPme)}] = pr[i;{L{e)}>l]dl^ ip-^^ dl ^ tP{ip{x)}dx, 

605 Jo Jo Jo 
606 

607 which concludes the proof. 

608 
609 
610 
611 

(yl2 Proof of Theorem 1 



Appendix 2 



613 Let ti — x*^-^/x*, for i = 0, 1, ... As mentioned by Skilling (2006 1, the t/s are independent beta(Af, 1) 

614 variates. Thus, Ui — tf defines a sequence of independent uniform [0, 1] variates. A Taylor expansion of 

615 Tyjv gives: 
616 

617 [cAT] 

77Ar = ^ {x,^l - Xi) {if{x*) - if{Xi)} 



1=1 



620 [cAf] 

621 = S^{x,..i-x^){^'{-\ogXi){\ogx^-\ogx*) + 0{\ogx,~\ogx*f\ 



622 
623 

624 where c—— lege, and — ^{&^^)- Fuithermore 



i-i 

Si = iV(loga;, -logx*) = ^(-1 -logUfc) 

fc=0 
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is a sum of independent, standard variables, as E{logUi) = —1 and var(logUi) = 1. Thus, 
(log Xi — log X*) = Op(iV^^/^), where the implicit constant in Op{N^^^^) does not depend on i, and 



V / e-V'WSiv(-)d4l + Op(iV-i/2)| 



since ip'{t) = 4>'{i/N) +0{N^^) for t £ [{i — 1)/N,i/N], where, again, the implicit constant in 
0{N~^) can be the same for all i, as ip" is bounded, and provided -Bjy(t) i s defined as B^it) = 
{cN)~^/'^S\^cNt\ for i G [0, 1]- According to Donsker's theorem (Kallenberg 2002j p.275), Bj^ converges 
to a Brownian motion B on [0, 1], in the sense that /{Bn) converges m distribution to f{B) for any 
measurable and a.s. continuous function /. Thus 



Jo c 



converges in distribution to 

r e-'^'{t)B{-)dt, 
Jo c 

which has the same distribution as the following zero-mean Gaussian variate: 



^ e-*i^'{t)B{t)dt = Sip' {s)B{- log s)ds. 



Appendix 3 
Proof of Lemma |2] 

For the sake of clarity, we make dependencies on d explicit in this section, including ipd for p, £d for e, 
and so on. We will use repeatedly the facts that ip is nonincreasing and that ip' is nonnegative. One has: 



ie[ed,l] 



Vd(sM(*)log(sVt)dt < -logeJ / sip'ais)ds\ < dlog(2i/2/T) 



for d> 1, since — J^^ s(p'^{s) ds < — sip'j^{s) ds = 1. This gives the first result. 
Let Sd = tp'^^{a'^), for < a < 1; is the probability that 

d 

(47r/d)^ 0^-1 < -21og(a) + log(2) - 1 
1=1 

assuming that the fl^'s are independent M{0, l/in) variates. The left-hand side is an empirical average 
of independent and identically distributed zero-mean variables. We t ake a so that the r ight-hand side is 
negative, which implies a > 2^/^ exp(— 1/2). Using large deviations (Kallenberg 2002 Chapter 27), one 
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673 has — log{sci)/d — > 7 > as — > +00, and 

674 



675 lvd = -l [ Vd(s)i^dW log(s V t) dsdt 



676 ^«,«e[e<i,i] 

677 f-logsd^ ' 

678 - 
679 
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688 
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682 /_logs^\ r /•! ^2 

683 >y 2 )\^~J ^<i^^)'^^~ j 'fd{s)ds + £d.ipd{£d)-Sd^pd{Sd)j ■ 

As ^ +00, - \og{sd)/d -^^,Sd^ 0, ipdisd) = a** 0, ipd{s) As < ipd{sd){'^ - Sd) ^ 0, and 

< / iPd{s) ds - £d^d{£d) < SdWdi'^) - fdiSd)} < r < 1, 

Jo 



by the definition of Sd, and the squared factor is in the hmit greater than or equal to (1 — r)^ 



